PISA 2012 Exploration Analysis

by Gabriel Barros

Table of Contents

Preliminary Wrangling

This document explores a dataset containing information for, approximately, 490,000 students who took part in the PISA 2012.

Gather

Assess

PISA 2012

World (geopandas)

Does not exist in the world dataframe:

Need to change:

Assessments

Clean

Column names aren't intuitive.

Define

Rename the column names using the rename function.

Code

Test

There are some country names that are different from the PISA 2012 dataset to the names in the world dataframe.

Define

Replace the country names with problems.

Code

Test

Missing categorical values in the ST27Q01 (phones), ST27Q02 (televisions), ST27Q03 (computers), ST27Q04 (cars), and ST28Q01 (books) columns.

Define

Since there are a small number of missing values in the 5 categorical columns, when compared to the DataFrame, I will drop the rows with missing values in those columns.

Code

Test

Missing numercic values in the AGE (age), BFMJ2 (father_isei), BMMJ1 (mother_isei), HISEI (highest_isei), and PARED (highest_parents_educ) columns.

Define

Fill the missing values in the numeric columns with the average of the variable for the correspondent country in the row.

Code

Test

The month of birth column (ST03Q01) has an impossible value (month 99).

Define

Drop the rows with the inaccurate value in the month_birth column (month 99).

Code

Test

The HISEI (highest_isei) column should have the highest vaule between BFMJ2 (father_isei) and BMMJ1 (mother_isei).

Define

Rebuild the highest_isei column after have handled the missing values in the mother_isei and father_isei columns.

Code

Test

Categories in the column ST28Q01 (books) are badly formatted.

Define

Strip trailing whitespaces in the books column.

Code

Test

Erroneous data type:

- SCHOOLID (school_id) and STIDSTD (student_id): string
- OECD (OECD_country) and ST04Q01 (gender): categorical (nominal)
- ST27Q01 (phones), ST27Q02 (televisions), ST27Q03 (computers), ST27Q04 (cars), and ST28Q01 (books): categorical       (ordinal)

Define

Convert school_id and student_id columns to string, convert OECD_country and gender columns to nominal data type, and convert phones, televisions, computers, cars, and books columns to ordinal data type.

Code

Test

Store

What is the structure of your dataset?

There are 463,020 students in the dataset with 32 features. Most variables are numeric, but there are 6 ordinal categorical variables (phones, televisions, computers, cars, and books) and 2 nominal categorical variables (OECD_country and gender).

Ordinal Categorical Variables:
(worst) ——> (best)
phones, televisions, computers, and cars: None, Two, One, Three or more
books: 0-10 books, 11-25 books, 26-100 books, 101-200 books, 201-500 books, More than 500 books

What is/are the main feature(s) of interest in your dataset?

I'm most interested in figuring out what features are best correlated with the plausible values (pv{1-5}_math, pv{1-5}_read, and pv{1-5}_scie), i.e., the academic performance.

What features in the dataset do you think will help support your investigation into your feature(s) of interest?

I expect that the highest parental education in years feature (highest_parents_educ) will have the strongest effect on each plausible value. I also think that country and the highest parental ISEI (highest_isei), including father and mother ISEI, will have significant effects.

Univariate Exploration

I'll start by looking at the distribution of the main variables of interest: pv_math, pv_read, and pv_scie.

The distribution of the plausible values is normal, with the peak between 400 and 600, approximately in 500. In each subject (math, reading, and science), the distribution for each plausible value (1 through 5) is very similar.

Next up, let's see the first 3 predictors variables of interest: father_isei, mother_isei, and highest_isei

The PISA's International Socio-Economic Index of Occupational Status (ISEI) captures the attributes of occupations that convert parents' education into income. The mother's ISEI distribution has a peak between 40 and 50, while the father's ISEI distribution has a peak between 20 and 30. The highest parental ISEI distribution follows the mother's ISEI distribution, with the peak between 40 and 50.

Now, I'll look at the highest parental education in years variable (highest_parents_educ).

The highest_parents_educ variable is more skewed to the left, but it has two peaks, the first in 12.5 years, and the second in 15 years.

I'll now look at the last three numeric variables: age, month_birth, and year_birth

The age variable has a bimodal distribution, with the first peak between 15.4 and 15.6 and the second between 16 and 16.2. The distribution of students in the month of birth variable is very like to each other, i.e., the number of students in each month doesn't change much. Finally, the year of birth variable has a disparity in the number of students that have been born in 1996, in comparison to 1997.

Now, I'll look at the categorical variables. To begin with, I'll start with the nominal variables (OECD_country and gender).

There are many more students from OECD countries than non OECD countries. Meanwhile, the difference in the number of male and female students is small, but there are more female students in the dataset.

Next up, let's move on to looking ordinal categorical variables (phones, televisions, computers, and cars).

The number of phones and televions at students' home is growing from none to three or more. Meanwhile, there are more students whith one computer at home than two, but most of students have three or more. The number of cars is the only ordinal variable that most of students have only one at their homes.

Finally, let's see the last ordinal variable: books

The category with the highest number of students is the middle one (26-100 books). Moreover, most of students are in the first three categories, i.e., with less than 100 books at home.

Finally, I'll look at the last variable: country.

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

All the plausible values, for each subject (math, reading, and science), have a normal distribution, with the peak between 400 and 600, approximately in 500.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

When investigating the month of birth variable, some outliers were identified (month 99). They were added to the assessment list, and then cleaned.

Bivariate Exploration

To start, I'll look at pairwise correlations present between features in the data. As shown in the Univariate Exploration, the distributions of the 5 plausible values for each subject (math, reading and science) are very similar to each other, so, to simplify, I'll show the relationship between the numeric variables and the first plausible value for each subject.

In order to make the plots clearer and render faster, I'll use a sample of 1,000 students.

As expected, the plausible values have a strong positive relationships with each other, and the highest parental ISEI and the highest parental education in years have a weak to moderate positive relationship with the plausible values.

The month of birth has a weak negative correlation with each plausible value. Meanwhile, the age variable has a weak positive releationship.

Now, let's look at how the categorical variables are related with the plausible values. Again, I'll only consider the first plausible value in each subject.

In each categorical variable, the median of the plausible value has a growing value in the categories. The only exception is with the variable number of cars, where the median is greater with students that have two cars at home.

Now, let's see the distribution of these categorical variables with a violin plot.

And, again, it's clear the trend of the plausible value to increase with the increase of the ordinal category, except for the cars variable, as seen early.

Next up, I'll see how the two nominal variables are related with the plausible values.

The OECD countries have a higher plausible value median in all the subjects than the non OECD. Meanwhile, the gender variable doesn't generate a significant impact at the plausible values.

And, again, let's see these two nominal variables in a violin plot.

The plots only confirms the fact that OECD countries have higher values in every subjects than non OECD countries.

After the preliminary look at the bivariate relationships out of the way, I want to dig into some of the relationships more. First, I want to see how the first plausible value of each subject (math, science, and reading) is related with the highest parental ISEI. In order to make the plots clearer and render faster, I'll use a sample of 10,000 students.

These plots suggest that the plausible value increase with the increase of the parental ISEI. This was already seen in the correlation heatmap shown early.

Another variable that I want to see in detail is the highest parental education in years, and in order to make the plots clearer and render faster, I'll, again, use a sample of 10,000 students.

And, again, it's clear the increase in the plausible values with the increase in the years of the highest parental education.

In the correlation heatmap, it's clear the strong positive correlation between the highest parental education in years and the highest parental ISEI, with a 0.5 correlation coefficient. Now, let's look at the relationships between the highest parental ISEI and the categorical variables.

For the ordinal variables, with a increase in the order, the median of the parental ISEI increases too.

Let's see how the nominal variables are related with the parental ISEI.

The median for OECD countries is higher than for non OECD. Meanwhile, there isn't a significant difference between female and male students.

Now, I'll explore the relationships between the highest parental education in years and the ordinal variables.

For all the variables above, the number of books and computers call attention for the big increase in the median in the last category.

Let's see how the nominal variables are related with the highest parental education in years.

And, again, the median for OECD countries is higher than for non OECD. Meanwhile, there isn't a significant difference between female and male students.

Now, let's look at how the variable country influences the plausible values.

As shown above, the countries remain with high averages for all three subjects. Moreover, it's possible to see a preponderance of dark colors, which indicates high values, in Europe, Oceania (Australia and New Zealand), North America (EUA and Canada), China, Japan and South Korea.

To explore how the continents influence the academic lives of students who have taken the PISA 2012, let's see the same type of the above visualization, but by continent.

Like it was shown in the world maps by country, the average of plausible values by subject by continent shows Europe and Oceania standing out.

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

As expected, the plausible values have a strong positive relationships with each other. Moreover, the highest parental ISEI and the highest parental education in years have a weak to moderate positive relationship with the plausible values. On the other hand, the month of birth has a weak negative correlation with each plausible value. Meanwhile, the age variable has a weak positive relationship.

There are also some interesting relationships between the plausible values and the ordinal variables (cars, phones, televisions, computers, and books), where the median of the plausible value has a growing value in the ordinal categories. The only exception is with the variable cars, where the median is greater with students that have two cars at home, rather than with three or more.

Another relationship that prove to be interesting is between the plausible values and the countries. The same countries remain with high averages for all the three subjects (math, reading, and science). Moreover, in the world maps plotted, a preponderance of dark colors, which indicates high values, were seen in Europe, Oceania (Australia and New Zealand), North America (EUA and Canada), China, Japan and South Korea.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

In the correlation heatmap, it's clear the strong positive correlation between the highest parental education in years and the highest parental ISEI, with a 0.5 correlation coefficient. But looking at the relationship between the highest parental ISEI and the ordinal variables, the median of the parental ISEI increases with a increase in the order. This trend was also seen between the highest parental education in years and the ordinal variables, with a special attention to the books and computers variables, where it is possible to se a the big increase in the median in the last category.

Multivariate Exploration

The main thing I want to explore in this part is how the categorical variables books and computers play into the relationship between the parents' situation (highest parental ISEI and highest parental education in years) and the plausible values that their children took in the PISA 2012. In order to make the plots clearer and render faster, I'll use a sample of 1,000 students.

In each of the faceted scatter plots of the three subjects (math, reading, and science), you can see the relationship of the number of books at home against the highest parental ISEI and plausible values 1. As the number of books at home level increases, the slope of the regression line increases too.

Now, let's see the same visualization, but with the number of computers at home instead of the number of books.

Reproducing the same plots with the number of computers at home, instead of the number of books, shows, again, a trend of increasing at the slope of the regression line, as the number of computers at home level increases.

Now, let's see the same visualizations, but with the highest parental education in years instead of the highest parental ISEI.

In each of the faceted scatter plots of the three subjects (math, reading, and science) in both categorical variables, computers and books, as the level of the ordinal variable increases, the slope of the regression line increases too.

Let's move on to looking at how being a OECD country impacts the relationship between the plausible values and the highest parental occupational status. In order to make the plots clearer and render faster, I'll use a sample of 1,000 students.

In each of the three scatter plots, it's possible to see a greater concentration of OECD countries at the top right corner, which means higher values of plausible values and parental ISEI.

Now, let's see the same visualizations, but with the highest parental education in years instead of the highest parental ISEI.

Reproducing the same plots with the highest parental education in years, instead of the highest parental occupational status, shows, again, a trend of a greater concentration of OECD countries at the top right corner, which means higher values of plausible values and parental years of education.

Now, let's see how the relationship between the average of the plausible values and the highest parental ISEI is influenced by the number of books at home.

Students who have fewer than 10 books at home have the lowest averages in each one of the subjects, regardless of the parental ISEI.

Next, I'll use the same approach as before, but, instead of the number of books at home, I'll use the number of computers.

Students who don't have any computer at home have the lowest averages in each one of the subjects, regardless of the parental ISEI.

Now, instead of using the highest parental ISEI, I'll use the highest parental education in years.

Again, regardless of the parents' years of education, students who have fewer than 10 books at home have the lowest averages in each subject.

Next, I'll use the same approach as before, but, instead of the number of books at home, I'll use the number of computers.

And again, students who don't have any computer at home have the lowest averages in every single subject, regardless of their parents' years of education.

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

I extended my investigation on how the categorical variables, books and computers, play into the relationship between the parents' situation (highest parental ISEI and highest parental education in years) and the plausible values that their children took in the PISA 2012. The multivariate exploration here showed that there indeed is a positive effect of increased number of books or computers on the plausible values when the highest parental ISEI or highest parental education in years increase.

Looking at how being a OECD country impacts the relationship between the plausible values and the highest parental ISEI/the highest parental education in years, shows a greater concentration of OECD countries at the top right corners of the scatter plots, which means higher values of plausible values and parental ISEI / parents' years of education.

Finally, I investigated how the relationship between the average of the plausible values and the highest parental ISEI / highest parental education in years, is influenced by the number of computers or books at home. Regardless of the increase of the two numeric variables, the average of the plausible values is lower for students who don't have any computer or have fewer than 10 books at home.

Were there any interesting or surprising interactions between features?

Looking back on the line plots, students who have between 11 and 25 books seems to be better at school than students who have many more books. This point was surprise to me.

References